feat(BA-3435): Implement Rolling Update deployment strategy by jopemachine · Pull Request #9997 · lablup/backend.ai

jopemachine · 2026-03-12T09:44:04Z

Overview

Implements the Rolling Update deployment strategy FSM (BEP-1049) — a pure-function evaluator that gradually replaces old-revision routes with new-revision routes, respecting surge and unavailability budgets.

Also refactors DEPLOYING timeout handling: removes the separate CHECK_DEPLOYING_TIMEOUT lifecycle type and DeployingTimeoutHandler, replacing them with the standard expired transition mechanism on DeployingProvisioningHandler. Timeout is now checked via phase_started_at from scheduling history (which is not reset on retries due to history merge), eliminating the need for the deploying_started_at column.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│  rolling_update.evaluate_cycle()          (strategy/rolling_update.py)      │
│                                                                             │
│  Pure function: (DeploymentInfo, routes, RollingUpdateSpec) → CycleResult   │
│                                                                             │
│  FSM:                                                                       │
│    1. Classify routes by revision_id:                                       │
│         old_active:       revision != deploying_revision, is_active()       │
│         new_provisioning: revision == deploying_revision, PROVISIONING      │
│         new_healthy:      revision == deploying_revision, HEALTHY           │
│         new_unhealthy:    revision == deploying_revision, UNHEALTHY/DEGRADED│
│         new_failed:       revision == deploying_revision, FAILED/TERMINATED │
│                                                                             │
│    2. new_provisioning? ───────────────────→ PROVISIONING (wait)            │
│    3. no old + new_healthy >= desired? ────→ COMPLETED                      │
│    4. all new failed/unhealthy? ──────────→ ROLLED_BACK                     │
│    5. Compute surge/unavailable budget:                                     │
│         max_total     = desired + max_surge                                 │
│         min_available = desired - max_unavailable                           │
│         to_create     = min(max_total - current, desired - new_live)        │
│         to_terminate  = min(available - min_available, old_active)          │
│       ─────────────────────────────────────→ PROGRESSING                    │
│                                              + RouteChanges(rollout_specs,  │
│                                                             drain_route_ids)│
└─────────────────────────────────────────────────────────────────────────────┘

Cycle-by-Cycle Example (`desired=3, max_surge=1, max_unavailable=1`)

Cycle 0 (initial)          Cycle 1 (provisioning)     Cycle 2 (1 new healthy)
Old: [■ ■ ■]               Old: [■ ■]                 Old: [■ ■]
New: []                     New: [◇]                   New: [■]
→ create 1, terminate 1    → wait (PROVISIONING)      → create 1, terminate 1
        │                          │                          │
        ▼                          ▼                          ▼
Cycle 3 (provisioning)     Cycle 4 (2 new healthy)    Cycle 5 (provisioning)
Old: [■]                    Old: [■]                   Old: []
New: [■ ◇]                 New: [■ ■]                 New: [■ ■ ◇]
→ wait (PROVISIONING)      → create 1, terminate 1    → wait (PROVISIONING)
                                   │                          │
                                   ▼                          ▼
                            Cycle 6 (completed)
                            Old: []
                            New: [■ ■ ■]
                            → COMPLETED — revision swap + DEPLOYING → READY

Legend: ■ = healthy, ◇ = provisioning

Safety Guards

Zero-downtime protection: When max_unavailable < desired, never terminates ALL old routes until at least one new route is healthy
Deadlock prevention: RollingUpdateSpec validator ensures at least one of max_surge or max_unavailable is positive
Rollback detection: If all new routes are FAILED_TO_START or UNHEALTHY (none healthy, none provisioning), the FSM returns ROLLED_BACK

Deploying Timeout Refactor

Previously, deploying timeout was handled by a separate DeployingTimeoutHandler registered under CHECK_DEPLOYING_TIMEOUT lifecycle type, running as an independent periodic task. This has been unified with the standard expired transition mechanism:

DeployingProvisioningHandler now declares an expired transition (→ DEPLOYING/ROLLING_BACK)
The coordinator checks skipped deployments for timeout using phase_started_at from scheduling history
phase_started_at is stable across retries (history records are merged via should_merge_with, incrementing attempts without changing created_at)
The deploying_started_at column and its migration have been removed entirely

Key Types

Type	Location	Purpose
`StrategyCycleResult`	`strategy/types.py`	Single deployment FSM result: sub_step + route_changes
`RouteChanges`	`strategy/types.py`	Route mutations: rollout_specs (Creator) + drain_route_ids
`RollingUpdateSpec`	`models/deployment_policy/row.py`	Config: max_surge, max_unavailable
`AbstractDeploymentStrategy`	`strategy/types.py`	Strategy interface that `RollingUpdateStrategy` implements

Changed Files

File	Change
`strategy/rolling_update.py`	Rolling update FSM implementation (stub → full)
`handlers/deploying.py`	Remove `DeployingTimeoutHandler`, add `expired` transition to provisioning handler
`coordinator.py`	Remove `CHECK_DEPLOYING_TIMEOUT` lifecycle, add skipped-timeout check, use `phase_started_at` uniformly
`types.py`	Remove `CHECK_DEPLOYING_TIMEOUT` enum member
`data/deployment/types.py`	Remove `deploying_started_at` field
`models/endpoint/row.py`	Remove `deploying_started_at` column
`test_rolling_update.py`	54 unit tests across 18 test classes covering all FSM branches

Test Coverage

Test Class	Scenarios
`TestBasicFSMStates`	PROVISIONING, COMPLETED, ROLLED_BACK, PROGRESSING
`TestMaxSurge`	Surge budget limits, surge=0
`TestMaxUnavailable`	Unavailability budget, unavailable=0
`TestCombinedSurgeAndUnavailable`	Both parameters active
`TestMultiCycleProgression`	Multi-step rollout sequences
`TestMixedRouteStatuses`	UNHEALTHY + HEALTHY mixed states
`TestTerminationPriority`	Old route termination ordering
`TestEdgeCases`	Empty routes, desired=0, no deploying revision
`TestRouteCreatorSpecs`	Creator spec correctness
`TestRealisticScenario`	Full 3-replica rolling update simulation
`TestDeadlockAndStall`	surge=0/unavailable=0 deadlock prevention
`TestDesiredReplicaCount`	Various replica counts (1, 5, 10)
`TestScaleDownDuringRollingUpdate`	Scale-down during active rollout
`TestConcurrentOperations`	Multiple revision edge cases

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--9997.org.readthedocs.build/en/9997/

📚 Documentation preview 📚: https://sorna-ko--9997.org.readthedocs.build/ko/9997/

Copilot

Pull request overview

This PR adds a Rolling Update deployment strategy evaluator (pure-function FSM) and refactors DEPLOYING timeout handling to use the coordinator’s standard expired transition mechanism, while also renaming/standardizing deployment sub-step typing and exposing the sub-step via the deployment API.

Changes:

Implement RollingUpdateStrategy.evaluate_cycle() with surge/unavailability budgeting and route mutation outputs.
Replace/standardize deployment sub-step handling with DeploymentLifecycleSubStep across coordinator/handlers/repos and add skipped-timeout checks that drive expired → DEPLOYING/ROLLING_BACK.
Add fallback revision-spec loading from endpoint-level fields and surface sub_step in deployment DTO/API.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/unit/manager/sokovan/deployment/strategy/test_rolling_update.py	New unit tests covering rolling update FSM outcomes and budgeting.
tests/unit/manager/sokovan/deployment/strategy/test_applier.py	Update tests for new sub-step enum and completed detection; adjust fixtures.
tests/unit/manager/sokovan/deployment/executor/conftest.py	Extend repo mock to support new `get_revision_spec_from_endpoint()` path.
src/ai/backend/manager/sokovan/deployment/strategy/types.py	Update strategy result types to use `DeploymentLifecycleSubStep`.
src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py	Implement rolling update route classification + create/drain decisions.
src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py	Adjust bulk route fetching conditions (incl. terminated new-revision routes).
src/ai/backend/manager/sokovan/deployment/strategy/applier.py	Update completed detection to `DEPLOYING_COMPLETED`; simplify applier surface.
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py	Add `expired` transition for provisioning; refactor rolling-back cleanup to repo.
src/ai/backend/manager/sokovan/deployment/handlers/base.py	Docstring alignment to new sub-step naming.
src/ai/backend/manager/sokovan/deployment/executor.py	Use endpoint-level revision spec fallback when no current revision exists.
src/ai/backend/manager/sokovan/deployment/deployment_controller.py	Update controller API to accept `DeploymentLifecycleSubStep`.
src/ai/backend/manager/sokovan/deployment/coordinator.py	Wire sub-step filtering, add skipped-timeout expiration handling, update task specs.
src/ai/backend/manager/services/deployment/service.py	Include `sub_step` in deployment data conversion; update lifecycle marking callsites.
src/ai/backend/manager/repositories/deployment/repository.py	Add `sub_steps` filtering to handler fetch; add `get_revision_spec_from_endpoint()`.
src/ai/backend/manager/repositories/deployment/db_source/db_source.py	Implement sub-step filtering + endpoint-based revision spec builder query.
src/ai/backend/manager/repositories/deployment/creators/deployment.py	Update lifecycle batch updater spec to use `DeploymentLifecycleSubStep`.
src/ai/backend/manager/models/endpoint/row.py	Switch sub-step column type + add `build_revision_spec_from_endpoint()` helper.
src/ai/backend/manager/event_dispatcher/handlers/schedule.py	Decode deployment sub-step using new enum type.
src/ai/backend/manager/data/deployment/types.py	Introduce `DeploymentLifecycleSubStep`; add `RouteStatus.is_provisioning()`.
src/ai/backend/manager/api/rest/deployment/adapter.py	Map `sub_step` into REST DTO conversion.
src/ai/backend/common/dto/manager/deployment/response.py	Add `sub_step` field to deployment response DTO.
proposals/BEP-1049/rolling-update.md	Update BEP doc to match new handler/timeout flow and FSM semantics.
proposals/BEP-1049-deployment-strategy-handler.md	Update design doc to reflect 2 DEPLOYING handlers + skipped-timeout expiry behavior.
changes/9997.feature.md	Changelog entry for rolling update strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T06:07:45Z

+          new_healthy=2, old=3 → can_terminate=2, old=3 → 2 (budget-limited)
+          new_healthy=4, old=1 → can_terminate=2, old=1 → 1 (old-count-limited)
+        """
+        available_count = classified.new_healthy_count + len(classified.old_active)


_compute_routes_to_terminate() treats every old_active route as "available" by using len(classified.old_active), but old_active includes PROVISIONING/UNHEALTHY/DEGRADED routes because RouteStatus.is_active() returns true for them. This can overestimate available_count and allow terminating additional old routes beyond the min_available budget (potential downtime). Consider tracking/counting healthy old routes separately (or filtering old_active by status == HEALTHY) when computing available_count and can_terminate, while still keeping the full old_active list for termination ordering.

Suggested change

available_count = classified.new_healthy_count + len(classified.old_active)

old_healthy_count = sum(

1 for route in classified.old_active if route.status == RouteStatus.HEALTHY

)

available_count = classified.new_healthy_count + old_healthy_count

Copilot · 2026-03-20T06:07:45Z

+        # Fetch non-terminated routes + terminated routes belonging to a
+        # deploying revision.  The FSM needs terminated new-revision routes
+        # to count accumulated failures for rollback detection, but old
+        # terminated routes are irrelevant and would bloat the result set.
+        deploying_revision_ids = {
+            deployment.deploying_revision_id
+            for deployment in deployments
+            if deployment.deploying_revision_id is not None
+        }
+        route_conditions: list[QueryCondition] = [
+            RouteConditions.by_endpoint_ids(endpoint_ids),
+        ]
+        if deploying_revision_ids:
+            route_conditions.append(
+                combine_conditions_or([
+                    RouteConditions.exclude_statuses([RouteStatus.TERMINATED]),
+                    RouteConditions.by_revision_ids(deploying_revision_ids),
+                ])
+            )


The comment and query logic here say terminated new-revision routes are needed for "rollback detection", but RollingUpdateStrategy currently never uses new_failed_count for any decision (it only logs it) and the tests/docs state rollback is handled by coordinator timeout. If rollback detection is no longer part of the FSM, consider simplifying the route query back to excluding TERMINATED routes (or updating the comment to reflect the real reason for including terminated routes) to avoid extra result-set bloat and confusion.

Copilot · 2026-03-20T06:07:45Z

+    def test_only_failed_new_no_old_rolls_back(self) -> None:
+        """Only failed new routes, no old → PROVISIONING (retries creation)."""
+        deployment = make_deployment(desired=2)
+        spec = RollingUpdateSpec(max_surge=1, max_unavailable=0)
+        routes = [
+            make_route(revision_id=NEW_REV, status=RouteStatus.FAILED_TO_START),
+            make_route(revision_id=NEW_REV, status=RouteStatus.FAILED_TO_START),
+        ]
+
+        result = RollingUpdateStrategy(spec).evaluate_cycle(deployment, routes)
+
+        assert result.sub_step == DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING
+


The test name test_only_failed_new_no_old_rolls_back is misleading: the assertion expects DEPLOYING_PROVISIONING, and the docstring also says it stays in PROVISIONING. Consider renaming the test (and/or updating the docstring) to reflect the actual behavior (retry/wait rather than rollback).

Copilot · 2026-03-20T06:07:46Z

 @pytest.fixture
-def mixed_summary() -> tuple[StrategyEvaluationSummary, UUID, UUID]:
+def rolled_back_summary() -> tuple[StrategyEvaluationSummary, set[UUID]]:
+    ep_id = uuid4()
+    summary = _build_summary({ep_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING})
+    return summary, {ep_id}
+
+
+@pytest.fixture
+def mixed_summary() -> tuple[StrategyEvaluationSummary, UUID, UUID, UUID]:
    provisioning_id = uuid4()
    completed_id = uuid4()
+    rolled_back_id = uuid4()
    summary = _build_summary(
        {
-            provisioning_id: DeploymentSubStep.PROVISIONING,
-            completed_id: DeploymentSubStep.COMPLETED,
+            provisioning_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING,
+            completed_id: DeploymentLifecycleSubStep.DEPLOYING_COMPLETED,
+            rolled_back_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING,
        },
        route_changes=RouteChanges(
            rollout_specs=[MagicMock()],
            drain_route_ids=[uuid4()],
        ),
    )
-    return summary, provisioning_id, completed_id
+    return summary, provisioning_id, completed_id, rolled_back_id


rolled_back_summary fixture is both unused in this test module and misleadingly named (it assigns DEPLOYING_PROVISIONING). Consider removing it (and the extra rolled_back_id in mixed_summary if it isn't needed) or renaming it to match the actual sub_step being tested to keep the applier tests focused and clear.

…rolling update PR Move non-rolling-update-evaluator changes to the base refactoring PR: - Coordinator: sub_step filtering, expired transition for skipped deployments - Deploying handlers: expired transition, rolling_back post_process - Executor: route creation refactoring - Repository/DB source: sub_steps filter parameter - Strategy applier: remove clear_deploying_revision (moved to repo) - Strategy types: docstring updates - BEP-1049 proposal updates - Test fixtures updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sses/failures/skipped) Replace the 4-field result (successes, errors, skipped, need_retry) with the 3-field pattern used by session coordinator: successes, failures, skipped. Handlers now report all non-success outcomes as failures (DeploymentExecutionError). The coordinator classifies failures into need_retry/expired/give_up based on retry count and timeout policy, matching the session side's approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot assigned jopemachine Mar 12, 2026

github-actions Bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 12, 2026

jopemachine added this to the 26.3 milestone Mar 12, 2026

jopemachine force-pushed the BA-3435_3 branch from bce97bc to 34cb535 Compare March 13, 2026 03:38

github-actions Bot added comp:common Related to Common component require:db-migration Automatically set when alembic migrations are added or updated labels Mar 16, 2026

jopemachine modified the milestones: 26.3, 26.4 Mar 17, 2026

jopemachine force-pushed the BA-3435_3 branch 2 times, most recently from d7a7761 to 10ddb6b Compare March 18, 2026 05:21

jopemachine removed the require:db-migration Automatically set when alembic migrations are added or updated label Mar 18, 2026

jopemachine force-pushed the BA-3435_3 branch 3 times, most recently from 5939e01 to d296163 Compare March 19, 2026 09:16

jopemachine marked this pull request as ready for review March 20, 2026 06:02

jopemachine requested review from a team, HyeockJinKim and Copilot March 20, 2026 06:02

Copilot started reviewing on behalf of jopemachine March 20, 2026 06:03 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

jopemachine force-pushed the BA-3435_3 branch from 2e09959 to 7f1a508 Compare March 20, 2026 06:56

jopemachine changed the base branch from main to refactor/flatten-deployment-lifecycle-sub-step March 20, 2026 07:09

jopemachine force-pushed the BA-3435_3 branch 7 times, most recently from 0d5fc37 to b94732d Compare March 20, 2026 09:38

Base automatically changed from refactor/flatten-deployment-lifecycle-sub-step to main March 23, 2026 01:52

jopemachine force-pushed the BA-3435_3 branch from b94732d to 327b5e5 Compare March 23, 2026 01:57

HyeockJinKim reviewed Mar 23, 2026

View reviewed changes

Comment thread src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py Outdated

HyeockJinKim reviewed Mar 23, 2026

View reviewed changes

Comment thread src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py

jopemachine requested a review from HyeockJinKim March 23, 2026 04:01

jopemachine force-pushed the BA-3435_3 branch 2 times, most recently from de3fb63 to 89391b8 Compare March 23, 2026 05:56

github-actions Bot added the area:docs Documentations label Mar 23, 2026

jopemachine force-pushed the BA-3435_3 branch from 827e774 to 451d3d5 Compare March 23, 2026 07:32

jopemachine and others added 15 commits March 23, 2026 16:39

feat(BA-3435): Implement Rolling Update deployment strategy

a3ca018

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wip

02f882e

wipp

9b0733a

wip

f4eb3d4

wip

2119e32

wip

cbbb215

wip

45755f0

fix: Apply formatting to coordinator after rebase

e537e51

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Remove unused resolve_sub_step from HandlerRegistry

f4e10a4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bep

85383d3

feat: Make RollingUpdateSpec evaluate_cycle's argument

963f91e

wip

5078fe5

wip

d89b051

jopemachine force-pushed the BA-3435_3 branch from 451d3d5 to d89b051 Compare March 23, 2026 07:40

HyeockJinKim approved these changes Mar 23, 2026

View reviewed changes

HyeockJinKim enabled auto-merge (squash) March 23, 2026 07:41

HyeockJinKim merged commit a8dfe30 into main Mar 23, 2026
32 of 33 checks passed

HyeockJinKim deleted the BA-3435_3 branch March 23, 2026 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-3435): Implement Rolling Update deployment strategy#9997

feat(BA-3435): Implement Rolling Update deployment strategy#9997
HyeockJinKim merged 15 commits into
mainfrom
BA-3435_3

jopemachine commented Mar 12, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        available_count = classified.new_healthy_count + len(classified.old_active)
+        old_healthy_count = sum(
+for route in classified.old_active if route.status == RouteStatus.HEALTHY
+        )
+        available_count = classified.new_healthy_count + old_healthy_count

Conversation

jopemachine commented Mar 12, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Cycle-by-Cycle Example (desired=3, max_surge=1, max_unavailable=1)

Safety Guards

Deploying Timeout Refactor

Key Types

Changed Files

Test Coverage

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jopemachine commented Mar 12, 2026 •

edited by github-actions Bot

Loading

Cycle-by-Cycle Example (`desired=3, max_surge=1, max_unavailable=1`)